Auto-Join: Joining Tables by Leveraging Transformations
نویسندگان
چکیده
Traditional equi-join relies solely on string equality comparisons to perform joins. However, in scenarios such as adhoc data analysis in spreadsheets, users increasingly need to join tables whose join-columns are from the same semantic domain but use different textual representations, for which transformations are needed before equi-join can be performed. We developed Auto-Join, a system that can automatically search over a rich space of operators to compose a transformation program, whose execution makes input tables equi-join-able. We developed an optimal sampling strategy that allows Auto-Join to scale to large datasets efficiently, while ensuring joins succeed with high probability. Our evaluation using real test cases collected from both public web tables and proprietary enterprise tables shows that the proposed system performs the desired transformation joins efficiently and with high quality.
منابع مشابه
Hash-based Symmetric Data Structure and Join Algorithm for OLAP Applications
Star schema is often used in dimensional approaches applied to OLAP applications. The fact table in the star schema typically contains a huge amount of data. When some of the dimension tables are also very large, it may take too much time and storage to join the fact table with these dimension tables. The performance of join algorithm becomes critical under such a condition. The uent join is a ...
متن کاملNeighbor Table Construction and Update in a Dynamic Peer-to-Peer Network
In a system proposed by Plaxton, Rajaraman and Richa (PRR), the expected cost of accessing a replicated object was proved to be asymptotically optimal for a static set of nodes and pre-existence of consistent and optimal neighbor tables in nodes [9]. To implement PRR’s hypercube routing scheme in a dynamic, distributed environment, such as the Internet, various protocols are needed (for node jo...
متن کاملAdaptDB: Adaptive Partitioning for Distributed Joins
Big data analytics often involves complex join queries over two or more tables. Such join processing is expensive in a distributed setting both because large amounts of data must be read from disk, and because of data shuffling across the network. Many techniques based on data partitioning have been proposed to reduce the amount of data that must be accessed, often focusing on finding the best ...
متن کاملA Visual Introduction to PROC SQL Joins.PDF
Real systems rarely store all their data in one large table. To do so would require maintaining several duplicate copies of the same values and could threaten the integrity of the data. Instead, IT departments everywhere almost always divide their data among several different tables. Because of this, a method is needed to simultaneously access two or more tables to help answer the interesting q...
متن کاملSEMA-JOIN: Joining Semantically-Related Tables Using Big Table Corpora
Join is a powerful operator that combines records from two or more tables, which is of fundamental importance in the field of relational database. However, traditional join processing mostly relies on string equality comparisons. Given the growing demand for adhoc data analysis, we have seen an increasing number of scenarios where the desired join relationship is not equi-join. For example, in ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 10 شماره
صفحات -
تاریخ انتشار 2017